fix: chunked BF16, buffer cap, drop fake FMA by AdaWorldAPI · Pull Request #55 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-03-30T07:42:22Z

Addresses review items from the BF16-direct rebase:

Fix 1: `mul_add` with zero addend

sums[bin].mul_add(splat(scale), splat(0.0)) → sums[bin] * splat(scale)

FMA with zero addend wastes the fuse — same latency as plain multiply but occupies the FMA port instead of the multiply port.

Fix 2+3: Chunked row-batch reading + buffer cap

Before: read_tensor_bf16_raw allocates full tensor as Vec<u16>. For ffn_gate_exps (128 experts × 5120 × 13824) that's 10.7 GB.

After: Read in row batches capped at MAX_BUF_ELEMS (64M u16 = 128 MB). Each batch: read → project → extend results. Buffer never exceeds 128 MB regardless of tensor size.

chunk_rows = min(n_rows, MAX_BUF_ELEMS / n_cols)
           = min(655360, 64M / 13824)
           = min(655360, 4629)
           = 4629 rows per chunk (~122 MB)

142 chunks for the largest tensor, each fully processed before the next read. Peak RAM stays at ~128 MB instead of 10.7 GB.

shrink_to(MAX_BUF_ELEMS) after oversized tensors prevents the buffer from persisting at inflated size.

Bonus: progress logging

Large tensors now print ... 4629/655360 rows (0.7%) per chunk so you see activity during multi-minute reads.

What's unused

read_tensor_bf16_raw() is no longer called in stream_index_gguf_bf16 (replaced by inline chunked reads). Kept for potential test use.

1. mul_add with zero addend → plain multiply (was wasting an FMA slot) 2. Chunked row-batch reading for BF16 tensors: caps buffer at 128 MB regardless of tensor size. A 10.7 GB ffn_gate_exps reads in ~4.8K row batches instead of one 10.7 GB allocation. Minimum batch = 8 rows (one F64x8 SIMD width). 3. Buffer shrink_to after oversized tensors: bf16_buf is truncated back to MAX_BUF_ELEMS (64M u16 = 128 MB) if it somehow grew past. 4. Progress logging within large tensors: prints row count every chunk so you see activity during multi-minute tensor reads. read_tensor_bf16_raw() is now unused in the main path (kept for potential direct use in tests or smaller models).

AdaWorldAPI merged commit 0168de9 into master Mar 30, 2026
5 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chunked BF16, buffer cap, drop fake FMA#55

fix: chunked BF16, buffer cap, drop fake FMA#55
AdaWorldAPI merged 1 commit into
masterfrom
claude/bf16-chunked-review

AdaWorldAPI commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AdaWorldAPI commented Mar 30, 2026

Fix 1: mul_add with zero addend

Fix 2+3: Chunked row-batch reading + buffer cap

Bonus: progress logging

What's unused

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix 1: `mul_add` with zero addend